1 Executive Summary

Insert a concise (max 200 word) exectutive summary. It should be a clear, interesting summary of main insights from the report.

2 Exploring the Dataset

  • Assess Data Provenance
  • Domain knowledge
  • Explore the data structure
  • Look for outliers and missing data

Background to report Motor vehicle collisions are a major cause of death and injury in New York (ref). The purpose of this report is to inform stakeholders of the geospatial patterns of motor vehicle collisions in NYC from 2012-2021, whilst providing some information on the initial data analysis. Relevant stakeholders who may benefit from the research presented in this report include the New York Police Department (NYPD), NYC Open Data, NYC Government, motor vehicle manufacturers, car insurance companies, road safety professionals, NYC pedestrians, cyclists and motorists, and the general public.

Assessment of data provenance The Motor Vehicle Collisions crash data (available at: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) was sourced from NYC OpenData and was provided by the New York Police Department (NYPD) for public safety purposes. The dataset is classified as free public data, and the NYC OpenData website includes thorough information on attribution, creation date, and the data generation process. The data was collected by police officers, who completed a MV-104AN report for all vehicle collisions in NYC that resulted in human injuries or fatalities, or at least $1000 worth of damage. From 1999-2016, only very basic data was collected by police officers as not all MV-104AN fields were entered. With the introduction of the Finest Online Records Management System (FORMS) in 2016, police officers began entering all MV-104AN fields using an electronic device. Thus, data prior to 2016 may contain fewer collision details compared to data after 2016. An additional limitation to this dataset is the absence of data prior to 2014, which prevents our ability to analyse trends over a longer period. The data is reliable as it was inputted by police officers who were trained in the data collection process. However, potential human error must still be considered. The dataset is regularly updated (daily) and a ‘MVCDataDictionary’ spreadsheet is available which records the revision history of the dataset. No changes have been documented on the spreadsheet thus far.

Domain knowledge NYC is one of the busiest cities in the world with a population of ~8,340,000 across 783.8 km². The city consists of five boroughs including Brooklyn, the Bronx, Manhattan, Queens and Staten Island. The Cross Bronx Expressway in NYC has been identified as the most congested road in the U.S. Other highly congested roads in NYC include sections of the George Washington Bridge, Fifth Avenue, and the Lincoln Tunnel. Previous reports indicated that pedestrians were ten times more likely to be killed than motorists in a crash, and most serious crashes involved private vehicles. In terms of geospatial trends, pedestrian collisions were two-thirds deadlier on major roads compared to smaller streets. Manhattan had four times more pedestrians injured or killed compared to the other boroughs. For ethics and privacy purposes, the Motor Vehicle Collisions dataset does not reveal any confidential information about the individuals involved in the collisions.

Data structure The Motor Vehicle Collisions crash table consists of 1.7 million rows and 29 columns. Each row corresponds to a motor vehicle collision, whereas the columns provide details of the crash including the date, time, location (borough, zip code, latitude, longitude, street names), number of persons/pedestrians/cyclists/motorists killed or injured, contributing factors, and vehicle type.

3 CLEANING

library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)
nyc = read.csv("MVC.csv")
#nyc %>% glimpse()
#nyc %>% summary()
cleannyc <- nyc[!(nyc$LONGITUDE == "" | nyc$LATITUDE == "" | nyc$LOCATION == "" | nyc$LATITUDE == 0 | nyc$LONGITUDE == 0),]
#cleannyc %>% glimpse()
#table(is.na(cleannyc))
cleannyc = na.omit(cleannyc)
#vis_miss(cleannyc, warn_large_data = FALSE)
#boxplot(cleannyc[,11:18],cex.axis = 0.6, las = 1, horizontal = TRUE,par(mar= c(5, 10, 4, 2) + 0.1))

As we can see in the boxplot above, there are many outliers especially in the number of motorist injured, and number of persons injured. Upon inspection when there was the number of motorist injured, it occurred at 9/9/2013 and is a Brooklyn Bus Accident which left 43 people injured when a car collided head on with a bus. And it also turns out that this is the same entry for the outlier in number of persons injured. The reasons why it is in both persons and motorist category is because the 43 people are in the bus, therefore classified as motorists. These outliers without inspection may seem extraordinary and perhaps a possibility of being faulty data collection, however with a further glance they seem to be valid and an important part of our data analysis. In fact in comparison to the mean and median of all these columns, most of the circles shown in the graph are considered outliers. Evidently the median and mean are around 0 accidents, which is expected. Because we have so many entries in data, and the probability of being in an accident is relatively small, this graph is exposed to skewedness, and thus we cannnot say that all these data points greater than 0 are outliers.

#max(cleannyc$NUMBER.OF.MOTORIST.INJURED)
#cleannyc %>% filter(NUMBER.OF.MOTORIST.INJURED == 43)
#max(cleannyc$NUMBER.OF.PERSONS.INJURED)
#cleannyc %>% filter(NUMBER.OF.PERSONS.INJURED == 43)

Even though these outliers are valid, they will affect our aggregate data, by dragging the mean higher than it should be. This is why median is much better than using mean, as it is not as affected by high outliers. We do not care about low outliers as the base is 0 and cannot fall lower. We can also take a look into more details and the affect of these two outliers using the graph below.

#fig = plot_ly(y = cleannyc$NUMBER.OF.PERSONS.INJURED, type = "box", name = "Number of Persons Injured")
#fig = fig %>% add_trace(y = cleannyc$NUMBER.OF.MOTORIST.INJURED, name = "Number of Motorists Injured") %>% layout(title = "Persons Injured and Motorist Injured Outlier Analysis")
#fig

5 References

  1. Fleet Report - Mayor’s Office of Operations. (2021). Retrieved 2 July 2021, from https://www1.nyc.gov/site/operations/performance/fleet-report.page
  2. End-to-End Response Time - 911 Reporting . (2021). Retrieved 2 July 2021, from https://www1.nyc.gov/site/911reporting/reports/end-to-end-repsonse-time.page
  3. NHTSA(2021). Retrieved 3 July 2021, from https://www-fars.nhtsa.dot.gov/Main/index.aspx
  4. EMS, One More Time. (2015). Retrieved 4 July 2021, from https://www.city-journal.org/html/ems-one-more-time-12793.html?wallit_nosession=1

6 Reflection on Data Wrangling

Insert your reflection on how data wrangling helped you explore your research questions. (Don’t forget to adjust information at the top of report regarding your name in the author field etc!!)